你知道電子郵件過濾器是怎麼把垃圾郵件挑出來的嗎?或者當你在社交媒體上發文,系統是如何自動標記你的貼文主題的?答案就是「文本分類」,而且通常是機器自動辨識。不過,如同近年臉書為人詬病的,大家也知道台灣的內容審查其實是對岸的審查員進行,所以會有些和台灣用戶的認知落差,但這點沒辦法避免。
我們再回到文本分類,簡單來說,文本分類就是讓機器學會怎麼把文字或文章分門別類。想像你在用你的Gmail,文本分類就像是你的個人助理,幫你把郵件分成「工作」、「家庭」、「垃圾」等等。這樣一來,你就不會把重要的工作郵件誤刪了,也不會不小心點開詐騙郵件。
不只是郵件和社交媒體,文本分類還有很多其他用途,比如新聞分類、情感分析、甚至是自動客服。所以下次當你看到機器似乎「懂」你的文字意思時,不妨想想背後可能就是文本分類在默默地工作呢!
一開始先載入套件。
library(tidyverse)
library(caret)
library(rsample)
library(tidytext)
library(tm)
接著來載入資料。
### load data
df_all_clean <- read_rds("data/df_all_clean.rds")
df_text_seg_unnest <- read_rds("data/df_text_segj_unnest.rds")
df_stopwords <- read_rds("data/stopWords.rds")
df_sentiment <- read_rds("data/sentiment/df_sentiment.rds")
其中,有一份停止詞、一份情感分析的詞典,這邊都是為了貼標
### deal with label
df_all_clean <- df_all_clean %>%
mutate(label = as.character(label)) %>%
filter(label %in% c("1", "-1")) %>%
mutate(label = if_else(label == "-1", "N", "Y")) %>%
mutate(label = fct_relevel(as.factor(label), "Y", "N"))
原本的標籤是有 1 跟 - 1,現在換成 N 跟 Y。接著我們去串停止詞。
### stopwords
df_news_seg_clean <- df_text_seg_unnest %>% filter(text_POS != "FW") %>%
bind_rows(
df_text_seg_unnest %>% filter(text_POS == "FW") %>%
unnest_tokens(text_segment, text_segment)
) %>%
filter(!str_detect(text_segment, "[a-zA-Z0-9]+")) %>%
filter(!str_detect(text_POS, "space|ther")) %>%
filter(!str_detect(text_segment, "「|」|【|】|/")) %>%
anti_join(df_stopwords, by = c("text_segment" = "word"))
### sentiment data
df_news_sentiment <- df_news_seg_clean %>% select(id, text_segment) %>%
left_join(df_sentiment %>% rename(text_segment = word)) %>%
filter(!is.na(type)) %>% mutate(score = if_else(type == "pos", 1, -1)) %>%
mutate(score_abs = abs(score)) %>%
group_by(id) %>% summarise(score = sum(score), score_abs = sum(score_abs))
再下一步串詞性。
### POS wide data
df_pos_wide <- df_news_seg_clean %>% select(id, text_POS2) %>% select(id, text_POS = text_POS2) %>%
mutate(text_POS = str_c("pos_", text_POS)) %>%
count(id, text_POS) %>% pivot_wider(names_from = text_POS, values_from = n, values_fill = list(n=0))
再下一步只留下出現五次以上的詞。
# 留下出現五次以上的詞
df_news_seg_count <- df_news_seg_clean %>%
count(text_segment, sort = T) %>%
filter(n >= 3) %>%
select(text_segment)
df_news_seg_kick <- df_news_seg_clean %>%
count(id, text_segment) %>%
inner_join(df_news_seg_count, by = "text_segment") %>%
left_join(df_all_clean, by = "id") ###串回標題
轉成 dfm。
dtm_news <- df_news_seg_kick %>%
cast_dtm(id, text_segment, n)
# remove sparse words
dtm_news_f <- removeSparseTerms(dtm_news, 0.997)
dim(dtm_news);dim(dtm_news_f)
再度換回dataframe。
# convert to dataframe
df_news_dtm <- tidy(dtm_news_f) %>% pivot_wider(id_cols = document ,
names_from = term,
values_from = count,
values_fill = list(count=0)) %>%
rename(id = document) %>% mutate(id = as.integer(id))
df_news_dtm
### all in once
# 1.串資料: POS_wide, sentiment, DFM
# 2.label要處理,留下+1, -1
df_word_feature <- df_all_clean %>%
# 串資料
left_join(df_pos_wide) %>%
left_join(df_news_sentiment) %>%
left_join(df_news_dtm) %>%
mutate_at(vars(matches("pos_")), ~ if_else(is.na(.),as.integer(0),.)) %>%
mutate_at(vars(matches("score")), ~ if_else(is.na(.),as.integer(0),as.integer(.)))
開跑模型!
### modeling
# spliting and modeling
set.seed(999)
split_set <- initial_split(df_word_feature, strata = label, prop = 0.7)
train_data <- training(split_set)
test_data <- testing(split_set)
train_data %>% select(-id, -text)
# use nnet
fit_glm <- glm(label ~ .,
data = train_data %>% select(-id, -text),
family = "binomial")
# Predict and Convert probs to binary
test_data$log_score <- predict(fit_glm, newdata = test_data, type = 'response', positive='Y')
test_data$log_output <- as.factor(ifelse(test_data$log_score > 0.5, "Y", "N"))
test_data %>% count(label, log_output)
# Evaluation Metrics
log.result <- confusionMatrix(data = test_data$log_output, test_data$label); log.result$table
log.precision <- log.result$byClass['Pos Pred Value']; log.precision #0.6497065
log.recall <- log.result$byClass['Sensitivity']; log.recall #0.5865724
log.F1 <- log.result$byClass['F1']; log.F1 #0.6165274
log.acc <- log.result$overall['Accuracy']; log.acc #0.5824065
這就是完整的一個文本分類模型囉!